With schools and daycares closed because of COVID-19, many academics are currently working from home with their kids underfoot. Writing in Nature, Minello (https://www.nature.com/articles/d41586-020-01135-9) suggested the pandemic is disproportionately affecting the productivity of female academics, because women often do more caregiving than men.
I quantified this effect by analyzing data on preprint submissions to arXiv (https://arxiv.org/) and bioRxiv (https://www.biorxiv.org/), two preprint servers that together cover many STEM fields. Peer review takes time, so it is still too soon to see COVID-19’s effects on the numbers of journal articles published by female versus male academics. However, a growing number of academics make their submitted or in-progress manuscripts available on preprint servers, meaning it might be possible to measure the pandemic’s effect on preprint submissions in real time.
First, I scraped submission data from arXiv, a preprint server for physics, math, computer science, statistics, and other quantitative disciplines. I used the aRxiv package to scrape the data, see:
Karthik, R. and K. Broman (2019). aRxiv: Interface to the arXiv API. R package version 0.5.19. https://CRAN.R-project.org/package=aRxiv
I began by scraping all records for March 15-April 15, 2020, during the COVID-19 pandemic, and for the same date range in 2019. I then expanded to the same dates in 2018. Finally, I scraped all the data for Jan. 1-March 15, 2020, immediately before the pandemic, and updated the pandemic data with the most recent dates (April 16-22, 2020). I scraped the data in batches, as recommended in the aRxiv package tutorial.
#Not run
#Get all submissions between March 15, 2020 and April 15, 2020 (during the COVID-19 pandemic)
n.2020 <- arxiv_count(query = 'submittedDate:[202003150000 TO 202004152400]')
n.2020.1 <- arxiv_count(query = 'submittedDate:[202003150000 TO 202003212400]') #Check number in date range
df.2020.1 <- arxiv_search(query = 'submittedDate:[202003150000 TO 202003212400]', limit=n.2020, batchsize=1000)
n.2020.1-length(df.2020.1$id) #Check that the right number of records were returned
n.2020.2 <- arxiv_count(query = 'submittedDate:[202003220000 TO 202003282400]')
df.2020.2 <- arxiv_search(query = 'submittedDate:[202003220000 TO 202003282400]', limit=n.2020, batchsize=1000)
n.2020.2-length(df.2020.2$id)
n.2020.3 <- arxiv_count(query = 'submittedDate:[202003290000 TO 202004032400]')
df.2020.3 <- arxiv_search(query = 'submittedDate:[202003290000 TO 202004032400]', limit=n.2020, batchsize=1000)
n.2020.3-length(df.2020.3$id)
n.2020.4 <- arxiv_count(query = 'submittedDate:[202004040000 TO 202004092400]')
df.2020.4 <- arxiv_search(query = 'submittedDate:[202004040000 TO 202004092400]', limit=n.2020, batchsize=1000)
n.2020.4-length(df.2020.4$id)
n.2020.5 <- arxiv_count(query = 'submittedDate:[202004100000 TO 202004121415]')
df.2020.5 <- arxiv_search(query = 'submittedDate:[202004100000 TO 202004121415]', limit=n.2020, batchsize=2000)
n.2020.5-length(df.2020.5$id)
n.2020.6 <- arxiv_count(query = 'submittedDate:[202004121420 TO 202004152400]')
df.2020.6 <- arxiv_search(query = 'submittedDate:[202004121420 TO 202004152400]', limit=n.2020, batchsize=1000)
n.2020.6-length(df.2020.6$id)
df.2020.full <- rbind(df.2020.1, df.2020.2, df.2020.3, df.2020.4, df.2020.5, df.2020.6)
n.2020-length(df.2020.full$id)
write.csv(df.2020.full, file="Data/arxiv_2020_data.csv")
#Get all submission between March 15, 2019 and April 15, 2019 (the same dates last year)
n.2019 <- arxiv_count(query = 'submittedDate:[201903150000 TO 201904152400]')
n.2019.1 <- arxiv_count(query = 'submittedDate:[201903150000 TO 201903222400]')
df.2019.1 <- arxiv_search(query = 'submittedDate:[201903150000 TO 201903222400]', limit=n.2019, batchsize=2000)
n.2019.1-length(df.2019.1$id)
n.2019.2 <- arxiv_count(query = 'submittedDate:[201903230000 TO 201903292400]')
df.2019.2 <- arxiv_search(query = 'submittedDate:[201903230000 TO 201903292400]', limit=n.2019, batchsize=2000)
n.2019.2-length(df.2019.2$id)
n.2019.3 <- arxiv_count(query = 'submittedDate:[201903300000 TO 201904052400]')
df.2019.3 <- arxiv_search(query = 'submittedDate:[201903300000 TO 201904052400]', limit=n.2019, batchsize=2000)
n.2019.3-length(df.2019.3$id)
n.2019.4 <- arxiv_count(query = 'submittedDate:[201904060000 TO 201904122400]')
df.2019.4 <- arxiv_search(query = 'submittedDate:[201904060000 TO 201904122400]', limit=n.2019, batchsize=2000)
n.2019.4-length(df.2019.4$id)
n.2019.5 <- arxiv_count(query = 'submittedDate:[201904130000 TO 201904152400]')
df.2019.5 <- arxiv_search(query = 'submittedDate:[201904130000 TO 201904152400]', limit=n.2019, batchsize=2000)
n.2019.5-length(df.2019.5$id)
df.2019.full <- rbind(df.2019.1, df.2019.2, df.2019.3, df.2019.4, df.2019.5)
n.2019-length(df.2019.full$id)
write.csv(df.2019.full, file="Data/arxiv_2019_data.csv")
#Get all submissions between March 15, 2018 and April 15, 2018 (same dates two years ago)
n.2018 <- arxiv_count(query = 'submittedDate:[201803150000 TO 201804152400]')
n.2018.1 <- arxiv_count(query = 'submittedDate:[201803150000 TO 201803212400]')
df.2018.1 <- arxiv_search(query = 'submittedDate:[201803150000 TO 201803212400]', limit=n.2018, batchsize=2000)
n.2018.1-length(df.2018.1$id)
n.2018.2 <- arxiv_count(query = 'submittedDate:[201803220000 TO 201804032400]')
df.2018.2 <- arxiv_search(query = 'submittedDate:[201803220000 TO 201804032400]', limit=n.2018, batchsize=2000)
n.2018.2-length(df.2018.2$id)
n.2018.3 <- arxiv_count(query = 'submittedDate:[201804040000 TO 201804092400]')
df.2018.3 <- arxiv_search(query = 'submittedDate:[201804040000 TO 201804092400]', limit=n.2018, batchsize=2000)
n.2018.3-length(df.2018.3$id)
n.2018.4 <- arxiv_count(query = 'submittedDate:[201804100000 TO 201804152400]')
df.2018.4 <- arxiv_search(query = 'submittedDate:[201804100000 TO 201804152400]', limit=n.2018, batchsize=2000)
n.2018.4-length(df.2018.4$id)
df.2018.full <- rbind(df.2018.1, df.2018.2, df.2018.3, df.2018.4)
n.2018-length(df.2018.full$id)
write.csv(df.2018.full, file="Data/arxiv_2018_data.csv")
#Get all submissions between Jan. 1, 2020 and March 15, 2020 (before COVID-19 pandemic)
n.early2020 <- arxiv_count(query = 'submittedDate:[202001010000 TO 202003152400]')
n.early2020.1 <- arxiv_count(query = 'submittedDate:[202001010000 TO 202001152400]')
df.early2020.1 <- arxiv_search(query = 'submittedDate:[202001010000 TO 202001152400]', limit=n.early2020, batchsize=2000)
n.early2020.1 - length(df.early2020.1$id)
n.early2020.2 <- arxiv_count(query = 'submittedDate:[202001160000 TO 202001312400]')
df.early2020.2 <- arxiv_search(query = 'submittedDate:[202001160000 TO 202001312400]', limit=n.early2020, batchsize=2000)
n.early2020.2 - length(df.early2020.2$id)
n.early2020.3 <- arxiv_count(query = 'submittedDate:[202002010000 TO 202002152400]')
df.early2020.3 <- arxiv_search(query = 'submittedDate:[202002010000 TO 202002152400]', limit=n.early2020, batchsize=2000)
n.early2020.3 - length(df.early2020.3$id)
n.early2020.4 <- arxiv_count(query = 'submittedDate:[202002160000 TO 202002292400]')
df.early2020.4 <- arxiv_search(query = 'submittedDate:[202002160000 TO 202002292400]', limit=n.early2020, batchsize=2000)
n.early2020.4 - length(df.early2020.4$id)
n.early2020.5 <- arxiv_count(query = 'submittedDate:[202003010000 TO 202003152400]')
df.early2020.5 <- arxiv_search(query = 'submittedDate:[202003010000 TO 202003152400]', limit=n.early2020, batchsize=2000)
n.early2020.5 - length(df.early2020.5$id)
df.early2020.full <- rbind(df.early2020.1, df.early2020.2, df.early2020.3, df.early2020.4, df.early2020.5)
n.early2020-length(df.early2020.full$id)
write.csv(df.early2020.full, file="Data/arxiv_early2020_data.csv")
#Get all submissions between Apr. 16, 2020 and April 22, 2020 (update analysis with most recent data)
n.update <- arxiv_count(query = 'submittedDate:[202004160000 TO 202004222400]')
df.update <- arxiv_search(query = 'submittedDate:[202004160000 TO 202004222400]', limit=n.update, batchsize=2000)
n.update - length(df.update$id)
write.csv(df.update, file="Data/arxiv_update2020_data.csv")
Next, I assigned gender to author names using the gender package, see:
Mullen, L. (2019). gender: Predict Gender from Names Using Historical Data. R package version 0.5.3, https://github.com/ropensci/gender.
This package returns the probability that a name is male or female by comparing the name to names in a database; I used the U.S. Social Security baby names database.
Please note: this is a brute force method of predicting gender, and it has many limitations, as discussed by the package authors on their GitHub repo and included links. By using this method, I am not assuming that individuals are correctly gendered in the resulting dataset, but merely that it provides insight into gender’s effects in aggregate across the population of preprint authors. This approach clearly mis-genders or excludes some individual authors, but it can reveal gender bias in a large enough dataset.
I predicted the genders of all preprint authors, and summarized the data as the number of male and female authors of each preprint, regardless of author order. This code takes a while to run, so it is not run when knitting this markdown document.
#Not run
#First compbine data for year-by-year comparison
df.2020 <- read.csv("Data/arxiv_2020_data.csv") #Read in data
df.2019 <- read.csv("Data/arxiv_2019_data.csv")
df.2018 <- read.csv("Data/arxiv_2018_data.csv")
df.full <- rbind(df.2018, df.2019, df.2020) #Combine in one dataframe
#Next combine data for 2020 comparison
df.early2020 <- read.csv("Data/arxiv_early2020_data.csv")
df.update <- read.csv("Data/arxiv_update2020_data.csv")
df.all2020 <- rbind(df.2020, df.early2020, df.update) #Combine in one dataframe
split.names <- function(x){strsplit(as.character(x), "|", fixed=TRUE)} #Write a function to split strings of author names
#For the year over year dataset
df.full$split.names <- lapply(df.full$authors, split.names) #Apply function
all_first_names <- word(unlist(df.full$split.names),1) #Make a list of all first author names
gender <- gender(all_first_names, method = "ssa") #Assign gender
gender <- unique(gender[ , c(1,2,4)]) #Keep only unique names
#This loop is an inelegant way of counting the number of male and female authors for each paper
tmp <- NULL
for(i in 1:length(df.full$authors)){
tmp <- as.data.frame(word(unlist(df.full$split.names[[i]]), 1))
colnames(tmp) <- "name"
tmp <- merge(tmp, gender, by="name", all.x=TRUE, all.y=FALSE)
df.full$male.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "male")))), na.rm=TRUE)
df.full$female.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "female")))), na.rm=TRUE)
}
df.full.output <- as.data.frame(apply(df.full,2,as.character))
write.csv(df.full.output, "Data/arxiv_full_gender.csv")
#Same for the all 2020 dataset
df.all2020$split.names <- lapply(df.all2020$authors, split.names)
tmp <- NULL
all_first_names <- word(unlist(df.all2020$split.names),1)
gender <- gender(all_first_names, method = "ssa")
gender <- unique(gender[ , c(1,2,4)])
for(i in 1:length(df.all2020$authors)){
tmp <- as.data.frame(word(unlist(df.all2020$split.names[[i]]), 1))
colnames(tmp) <- "name"
tmp <- merge(tmp, gender, by="name", all.x=TRUE, all.y=FALSE)
df.all2020$male.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "male")))), na.rm=TRUE)
df.all2020$female.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "female")))), na.rm=TRUE)
}
df.all2020.output <- as.data.frame(apply(df.all2020,2,as.character))
write.csv(df.all2020.output, "Data/arxiv_all2020_gender.csv")
Next, I zoomed in on the months leading up to widespread stay-at-home orders and school and childcare closures that North Americans experienced beginning in late March or early April, 2020. (These measures were implemented to different degrees and on different dates in different parts of the world.)
#All authors
df.all2020 <- read.csv("Data/arxiv_all2020_gender.csv") #Read in data
df.all2020 <- df.all2020[!duplicated(df.all2020), ] #Remove duplicated removes
df.all2020$month <- floor_date(as.Date(df.all2020$submitted), "month") #Bin by month
arxiv.m <- as.data.frame(ungroup(subset(df.all2020) %>% group_by(month) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE),n.days = length(unique(as.Date(submitted)))))) #Summarize by month
arxiv.m.long <- gather(arxiv.m, gender, n, female.n:male.n) #Make wide data long
arxiv.m.long$pubs.per.day <- arxiv.m.long$n/arxiv.m.long$n.days #Adjust for different numbers of days in each month
arxiv.m.long$gender <- as.factor(arxiv.m.long$gender) #Make sure gender is a factor
levels(arxiv.m.long$gender) <- c("Female", "Male") #Capitalize genders
p3 <- ggplot(data=arxiv.m.long, aes(fill=gender, y=pubs.per.day, x=month))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+ggtitle("arXiv: early 2020")+xlab("Month")+ylab("Preprint authors per day (no.)")+labs(fill="Gender")+facet_grid(~gender)+theme(legend.position="none", plot.title = element_text(hjust = 0.5))
p3
The numbers of male authors of arXiv preprints have increased through early 2020, while numbers of female authors of arXiv preprints have basically plateaued during the pandemic.
Next, I scraped submission data from bioRxiv (https://biorxiv.org/), which is the main preprint server for biology. I used the rbiorxiv package, see:
Fraser, N (2020). rbiorxiv. R package, https://github.com/nicholasmfraser/rbiorxiv
I scraped all records for March 15-April 15, 2020, during the COVID-19 pandemic, and for the same date range in 2019 and 2018. I then expanded to all of 2020, up to April 22, 2020.
#Not run
#Get all submissions between Jan 1, 2020 and April 22, 2020
df.b.2020 <- biorxiv_content(from = "2020-01-01", to = "2020-04-22", limit = "*", format = "df")
#Get all submissions for March 15 to April 15, 2019
df.b.2019 <- biorxiv_content(from = "2019-03-15", to = "2019-04-15", limit = "*", format = "df")
#Get all submissions for March 15 to April 15, 2018
df.b.2018 <- biorxiv_content(from = "2018-03-15", to = "2018-04-15", limit = "*", format = "df")
write.csv(df.b.2020, "Data/biorxiv_2020_data.csv")
write.csv(df.b.2019, "Data/biorxiv_2019_data.csv")
write.csv(df.b.2018, "Data/biorxiv_2018_data.csv")
I then inferred the gender of corresponding authors of bioRxiv preprints, as above. Note that the bioRxiv API only returns first author names for the corresponding authors, and not for all authors. (Unfortunately.)
df.b.2018 <- read.csv("Data/biorxiv_2018_data.csv")
df.b.2019 <- read.csv("Data/biorxiv_2019_data.csv")
df.b.all2020 <- read.csv("Data/biorxiv_2020_data.csv")
df.b.full <- rbind(df.b.2018, df.b.2019, subset(df.b.all2020, as.Date(date) >= "2020-03-15" & as.Date(date) <= "2020-04-15"))
df.b.full$year <- as.factor(year(as.Date(df.b.full$date)))
df.b.all2020$date <- as.Date(df.b.all2020$date)
df.b.full$cor.author.first.name <- sapply(strsplit(as.character(df.b.full$author_corresponding), " "), head, 1) #Extract first names
df.b.all2020$cor.author.first.name <- sapply(strsplit(as.character(df.b.all2020$author_corresponding), " "), head, 1) #Extract first names
gender <- NULL
gender <- gender(df.b.full$cor.author.first.name, method = "ssa")
gender <- unique(gender[ , c(1,2,4)])
df.b.full <- merge(df.b.full, gender, by.x = "cor.author.first.name", by.y ="name", all = TRUE)
df.b.full <- df.b.full[!duplicated(df.b.full),]
gender <- NULL
gender <- gender(df.b.all2020$cor.author.first.name, method = "ssa")
gender <- unique(gender[ , c(1,2,4)])
df.b.all2020 <- merge(df.b.all2020, gender, by.x = "cor.author.first.name", by.y ="name", all = TRUE)
df.b.all2020 <- df.b.all2020[!duplicated(df.b.all2020),]
I compared the numbers of male and female corresponding authors on bioRxiv preprints between Mar/Apr 2019 and 2020.
biorxiv.yr <- as.data.frame(ungroup(subset(df.b.full, !is.na(gender)) %>% group_by(year, gender) %>% summarize(n=n())))
biorxiv.yr.wide <- spread(biorxiv.yr, year, n)
biorxiv.yr.wide$gender <- as.factor(biorxiv.yr.wide$gender)
levels(biorxiv.yr.wide$gender) <- c("Female", "Male")
biorxiv.yr.long <- gather(biorxiv.yr.wide, year, number, `2018`:`2020`)
biorxiv.yr.long$year <- as.factor(biorxiv.yr.long$year)
biorxiv.yr.wide$per.dif.1920 <- (biorxiv.yr.wide$`2020`/biorxiv.yr.wide$`2019`)*100-100
p4 <- ggplot(data=subset(biorxiv.yr.long, year != 2018), aes(fill=as.factor(year), y=number, x=as.factor(gender)))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authors (no.)")+labs(fill="Year")+scale_fill_manual(values = wes_palette("Royal1"))+ggtitle("bioRxiv: 2019 vs 2020")+theme(plot.title = element_text(hjust = 0.5))+annotate("text", x=c(1,2), y=c(1100,2650), label = paste0("+", round(biorxiv.yr.wide$per.dif.1920, 1), "%"))+theme(legend.position="none")
p4
p5 <- plot_grid(p1, p2, p4, nrow=1) #Combine into part of a single figure
The gender difference among corresponding authors for bioRxiv preprints is more modest than in the full arXiv dataset, but the number of male corresponding authors of bioRxiv preprints has still increased more than the number of female corresponding authors of bioRxiv preprints, year over year.
As for arXiv submissions, I also compared bioRxiv submissions across months for early 2020.
df.b.all2020$month <- floor_date(df.b.all2020$date, "month")
biorxiv.m <- as.data.frame(ungroup(subset(df.b.all2020, !is.na(gender)) %>% group_by(month, gender) %>% summarize(n=n(), n.days = length(unique(date)), pubs.per.day = n/n.days)))
biorxiv.m$gender <- as.factor(biorxiv.m$gender)
levels(biorxiv.m$gender) <- c("Female", "Male")
p6 <- ggplot(data=biorxiv.m, aes(fill=gender, y=pubs.per.day, x=month))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+ggtitle("bioRxiv: early 2020")+xlab("Month")+ylab("Preprint authors per day (no.)")+facet_grid(~gender)+theme(legend.position="none", plot.title = element_text(hjust = 0.5))
p6
p7 <- plot_grid(p3, p6, nrow=1) #Combine into part of a single figure
The numbers of male authors of bioRxiv preprints have increased steadily through early 2020, while numbers of female authors of bioRxiv preprints have increased only slightly.
I put it all together in a single figure.
p8 <- plot_grid(p5, p7, nrow=2)
p8
save_plot("figure.png", p8, base_height=16, base_width=16)
Throughout this analysis, effects are conservative because many preprints describe research completed months ago. However, the trends in both preprint servers are consistent with the hypothesis that the pandemic is disproportionately hurting the productivity of female scholars. How long this effect will persist, and what its downstream consequences might be for journal publications and academic careers, are open questions. In summary, in a ‘publish or perish’ world, it seems this pandemic could be setting back the hard-won progress of women in STEM.